Speech Recognition

from Cambridge University studies


This speech processing project focused on identifying phones (distinct sounds) in short audio recordings using a bidirectional LSTM.
A search within the hyper-parameter space of the model was conducted, exploring numerous architectural tweaks and training environments. The figure above visualises the confusion matrix of a model trained for phonetic identification, it visualises the frequency of miss-classifications. Accordingly the main diagonal represents the frequency of correct classifications. The additional INS and DEL columns represent the frequency of insertions and deletions required to convert the predicted sequence of phones into the target, according to the Levenshtein Distance.

Download the Report